feat: per-session pipelining via seq-ordered data ops#903
feat: per-session pipelining via seq-ordered data ops#903dazzling-no-more wants to merge 1 commit intotherealaleph:mainfrom
Conversation
|
Reviewed via Anthropic Claude. Verified locally on top of v1.9.18:
Wire-protocol design looks careful — Hesitating on immediate merge because of two factors:
Plan: leaving open for 5–7 days of community testing like we did with #359 (drive queue). If two or three users on different ISPs report it stable for streaming downloads + heavy concurrent browsing, I'll squash-merge as v1.9.19. For testers reading this: pull
Thanks @dazzling-no-more — this is the kind of structural perf work that's hard to land safely. The seq-lock + capability-negotiation design is exactly the right architecture for what is ultimately a not-quite-ordered transport layer. |
Summary
Per-session pipelining for full-tunnel mode. Allows up to 2 in-flight
dataops per session to overlap one Apps Script round-trip with the next on a single TCP stream, lifting the per-session throughput cap that came from strict request/await/request sequencing.Wire protocol (forward-compatible — old peers ignore the new fields):
BatchOp.seq: Option<u64>— per-session monotonic sequence number, sent only ondataops in pipelined sessions.TunnelResponse.seq: Option<u64>— echoed by the server.TunnelResponse.caps: Option<u32>— advertised onconnect/connect_datasuccess replies. Bit 0 =CAPS_PIPELINE_SEQ.u64rather thanu32so a long-lived TCP session generating ~100 ops/s doesn't saturate after ~1.4 years.Tunnel-node (new): when a
dataop carriesseq, the server processes the entire (write, drain) sequence under a per-session seq lock and only releases it afterexpectedadvances. Two ops can travel through different deployments and arrive out of order — the seq lock guarantees they're processed (and replied to) in send order, so a client reorder buffer never sees bytes from seq=N+1 land before seq=N's bytes. Olddataops withoutseqcontinue down the legacy unordered path.connect/connect_datasuccess replies setcaps: Some(CAPS_PIPELINE_SEQ)— old clients ignore the unknown field, new clients opt into pipelining.Client (new):
tunnel_loop_pipelinedkeeps up toPIPELINE_DEPTH = 2dataops in flight per pipelined session. Reordering is implicit —VecDeque<(seq, oneshot::Receiver)>awaited in FIFO order means replies for later seqs sit in their oneshot buffers until our task gets to them. Idle sessions stay at depth=1 (server long-polls; no speculative quota burn). Pipelining only enables when the connect reply advertisedcapsAND no prior reply has been observed dropping theseqfield (mixed-version backend protection).Cross-cutting correctness:
BatchWaitprimitive replaces N×M per-job watchers with one watcher per uniqueSessionInner(deduplicated byArc::as_ptr).reader_taskswitched tonotify_waiters()so concurrent batches all wake on each push.pipelined_reply_timeoutderived aseffective_batch * 2 + slack— covers permit-saturated semaphore wait plus the op's own request budget.had_uplink == true) wait per-session up toACTIVE_DRAIN_DEADLINEand run a straggler-settle loop, so multi-packet upstream responses land in one drain instead of being split across batches. Idle ops still subscribe to the sharedBatchWaitfor cross-session wake.closeops collected during dispatch and processed AFTER seq jobs ONLY when an earlier same-sid op is already deferred (else run inline) — preserves both[data(seq), close](data writes before close tears the session down) AND[close, data](close runs first, data hits the closed session).client_send_closedflag intunnel_loop_pipelinedso a TCP half-close from the local client stops queuing new ops but keeps draining already-queued reply oneshots — valid HTTP request/response with client-side shutdown no longer drops downlink bytes.tunnel_batch_request_with_timeout(3-argtunnel_batch_request_tokept as backward-compat wrapper) so both h2 (h2_relay_request) and h1 (read_http_response_with_timeout) honor the pipelined-batch floor — avoids inner 30 s default firing before our outer 60 s budget on user-tuned configs.TCP_DRAIN_MAX_BYTESper iteration (mirrors the seq path), so concurrent seq jobs continue to see headroom while Phase 2's drain is in flight.Public API change
TunnelMux::udp_openandudp_dataalready takeimpl Into<Bytes>(carried over from the prior zero-copy PR).tunnel_batch_request_tokeeps its 3-arg signature; newtunnel_batch_request_with_timeoutexposes the explicit timeout for callers that need it. Public structsTunnelResponseandBatchOpgain new optional fields with#[serde(default)]/#[serde(skip_serializing_if = "Option::is_none")]— old wire shapes deserialize cleanly and serialize identically.Realistic perf delta
For active streaming downloads on a single TCP session, expect 30–80% throughput improvement depending on what fraction of the Apps Script RTT is server-side vs network. Less for idle / long-poll workloads (server-side seq enforcement serializes drains per session). No win on multi-session workloads (those were already parallel across sessions). Documented latency trade-off: a stuck seq op (lost earlier seq from a dropped batch) holds the batch HTTP response open for up to
SEQ_WAIT_TIMEOUT = 30 s; the pinned regression testunrelated_seq_session_in_same_batch_is_not_delayed_past_seq_waitensures unrelated sessions' intrinsic processing time stays sub-second within that window.Test plan
cargo build --bins --libclean both cratescargo test --lib(client) +cargo test(tunnel-node) pass — 216 + 57 = 273 totalwait_for_seq_turn, write-failure path bumpsexpected(SeqAdvanceOnDrop), tail preservation under cap, batch budget shared with seq jobs, EOF withholding while buffer non-empty,[data(seq), close]runs data before close,[close, data]runs close firstBatchWaitwakes all jobs on single push, dedupes watchers per unique inner, wakes across concurrent batches on the same session, mixed ready+idle batch latency, Phase 2 budget reservation capped atTCP_DRAIN_MAX_BYTESseq=0first, in-seq-order output when replies arrive reversed, closes on seq mismatch, closes onseq=None(mixed-version detection), drains in-flight on client half-close,pipelined_reply_timeoutcovers2× batch_timeout,mark_pipelining_disabledsticky and overrides capsconnect_datacapability detection)